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Abstract In many classification systems, sensing modalities have different acquisition costs. 
It is often unnecessary to use every modality to classify a majority of examples. We study 
a multi-stage system in a prediction time cost reduction setting, where the full data is avail- 
able for training, but for a test example, measurements in a new modality can be acquired 
at each stage for an additional cost. We seek decision rules to reduce the average measure- 
ment acquisition cost. We formulate an empirical risk minimization problem (ERM) for a 
multi-stage reject classifier, wherein the stage k classifier either classifies a sample using 
only the measurements acquired so far or rejects it to the next stage where more attributes 
can be acquired for a cost. To solve the ERM problem, we show that the optimal reject clas- 
sifier at each stage is a combination of two binary classifiers, one biased towards positive 
examples and the other biased towards negative examples. We use this parameterization to 
construct stage-by-stage global surrogate risk, develop an iterative algorithm in the boosting 
framework and present convergence and generalization results. We test our work on syn- 
thetic, medical and explosives detection datasets. Our results demonstrate that substantial 
cost reduction without a significant sacrifice in accuracy is achievable. 
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1 Introduction 

In many applications including homeland security and medical diagnosis, decision systems 
are composed of an ordered sequence of stages. Each stage is associated with a sensor or a 
physical sensing modality. Typically, a less informative sensor is cheap (or fast) while a more 
informative sensor is either expensive or requires more time to acquire a measurement. In 
practice, a measurement budget (or throughput constraint) does not allow all the modalities 
to be used simultaneously in making decisions. The goal in these scenarios is to attempt to 
classify examples with low cost sensors and limit the number of examples for which more 
expensive or time consuming informative sensor is required. 
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Fig. 1 Multi-Stage System consists of K stages. Each stage is a binary classifier with a reject option. The 
system incurs a penalty of 5j. + [ at kth stage if it rejects to seek more measurements. The kth classifier only 
sees the first k sensing modalities in making a decision. 



For example, in explosives detection, in the first stage, an infrared imager or a metal 
detector can be used with high throughput and low cost. A second stage could be the the use 
of a slower, more expensive active millimeter wave (AMMW) scanner. The final third stage 
is a time consuming human inspection. In medical applications, first stages are typically 
non-invasive procedures (such as a physical exam) followed by more expensive tests (blood 
test, CT scan etc) and the final stages are invasive (surgical) procedures. 

Many such examples share a common structure (see Fig. [TJ, and we list some of its 
salient aspects below: 



(A) Sensors & Ordered Stages: Each stage is associated with a new sensor measurement or 
a sensing modality. Multiple stages are an ordered sequence of sensors or sensor modalities 
with later stages corresponding to expensive or time-consuming measurements. In many 
situations, there is often some flexibility in choosing a sensing modality from a collection 
of possible modalities. In these cases, the optimal choice of sensing actions also becomes 
an issue. While our methodology can be modified to account for this more general setting, 
we primarily consider a fixed order of stages and sensing modalities in this paper. This is 
justified on account of the fact that many of the situations we have come across consist of 
a handful of sensors or sensing modalities. Consequently, for these situations, the problem 
of choosing sensor ordering is not justified since one could by brute force enumerate and 
optimize over the different possibilities. 
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(B) Reject Classifiers: Our sequential decision rules either attempt to fully classify an in- 
stance at each stage or "reject" the instance on to the next stage for more measurements in 
case of ambiguity. For example, in explosives detection, a decision rule in the first stage, 
based on IR scan, would attempt to detect whether or not a person is a threat and identify the 
explosive type/location in case of a threat. If the person is identified as a threat at the first 
stage it is unnecessary (and indeed dangerous - the explosive could be detonated) to seek 
more information. Similarly in medical diagnosis if a disease is diagnosed at an early stage, 
it makes sense to begin early treatment rather than waiting for more conclusive tests. 

(C) Information vs. Computation: Note that our setup can only use the partial measure- 
ments acquired up to a stage in making a decision. In other methods, such as detection 
cascades ([Viola and Jones, 2001 ]), the full measurement and therefore all the information 
is available to every stage. Therefore, any region in the feature space can be carved out with 
more complex regions in the measurement space, or equivalently complex features can be 
extracted but with higher costs. In contrast, we have only partial measurements (or infor- 
mation) and so any feature or classifier that we employ has to be agnostic to unavailable 
measurements at that stage. 

The two stage example in Fig. [2] illustrates some of the advantages of our scheme over 
the alternative scheme that first acquires measurements from all the sensing modalities, 
which we refer to as the centralized classifier. A reject classifier utilizes the 2nd stage sensor 
only for a fraction of the data but achieves the same performance as the centralized classifier. 




1st sensor -5 1 

Cost 



Fig. 2 (Advantage of a 2 stage classifier: 10 samples, binary (squares, circles). The red line is the optimal 
decision when using only 1 st stage modality. The blue line is optimal if using both. (2nd stage) The curve 
is classification error vs. samples rejected (cost) The red point corresponds to classifying everything at stage 
1. The blue corresponds to rejecting everything and classifying using both modalities. (Stage 2) The green 
is a partial reject strategy. The samples outside the green region are classified using only the first modality, 
and samples inside the region are rejected to stage 2 and are classified using both modalities. Note that blue 
and green have the same error, while the reject strategy (green) has to use 2nd stage sensor only for i of 
examples, reducing the cost by a factor of 2. 

Our approach is based on the so called Prediction Time Cost Reduction approach QKanani and Melville, 2008 1). 
Specifically, we assume a set of training examples in which measurements from all the sen- 
sors or sensing modalities as well as the ground truth labels are available. Our goal is to 
derive sequential reject classifiers that reduces cost of measurement acquisition and error in 
the prediction (or testing) phase. 
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We show that this sequential reject classifier problem can be formulated as an instance of 
a partially observable Markov Decision Process (POMDP) ([Kaelblin g et al., 1998| ) when 
the class-specific probability models for the different sensor measurements are known. In 
this case the optimal sequential classifier can be cast as a solution to a Dynamic Program 
(DP). The DP solution is a sequence of stage-wise optimization problems, where each stage 
problem is a combination of the cost from the current stage and the cost-to-go function that 
is carried on from later stages. 

Nevertheless, class probability models are typically unknown; our scenarios produce 
high-dimensional sensor data (such as images). Consequently, unlike some of the conven- 
tional approaches ([Ji and Carin, 2007 1), where probability models are first estimated to 
solve POMDPs, we have to adopt a non-parametric discriminative learning approach. We 
utilize the structure of the POMDP solution to empirically approximate the value of the 
cost-to-go function only at a discrete subset of the data-space. Next, instead of interpolat- 
ing or parameterizing the cost-to-go function and learning it from data, we formulate an 
empirical discriminative objective that utilizes point-wise cost-to-go estimates evaluated on 
the training set and directly learn classifiers that minimize this objective. Using this decom- 
position, we formulate a novel multi-stage expected risk minimization (ERM) problem. We 
solve this ERM problem at each stage by first factoring the cost function into classification 
and rejection decisions. Then we transform reject decisions into a binary classification prob- 
lem. Specifically, we show that the optimal reject classifier at each stage is a combination 
of two binary classifiers, one biased towards positive examples and the other biased towards 
negative examples. The disagreement region of the two then defines the reject region. 

We then approximate this empirical risk with a global surrogates. We present an iter- 
ative solution and demonstrate local convergence properties. The solution is obtained in 
a boosting framework. We then extend well-known margin-based generalization bounds 
(IBar tlett et al., 1998) ) to this multi-stage setting. We tested our methods on synthetic, med- 
ical and explosives datasets. Our results demonstrate an advantage of multistage classifier: 
cost reduction without a significant sacrifice in accuracy. 

1.1 Related Work 

Active Feature Acquisition (AFA): The subject of this paper is not new and has been 
studied in the Machine Learning community as early as |MacKay, 1992|. Our work is 
closely related to the so called prediction time active feature acquisition (AFA) approach 
in the area of cost-sensitive learning. The goal there is to make sequential decisions of 
whether or not to acquire a new feature to improve prediction accuracy. A natural approach 
is to formalize a problem as an POMDP. |Ji and Carin, 2007||Kapoor and Horvitz, 2009| 
model the decision process and infer feature dependencies while taking acquisition costs 
into account. [Sheng and Ling, 2006 Bilgic and Getoor, 2007 Zubek and Dietterich, 2002 1 
study strategies for optimizing decision trees while minimizing acquisition costs. The con- 
struction is usually based on some purity metric such as entropy. [Kanani and Melville, 2008 1 
proposes a method that acquires an attribute if it increases an expected utility. However, all 
these methods require estimating a probability likelihood that a certain feature value occurs 
given the features collected so far. While surrogates based on classifiers or regressors can 
be employed to estimate likelihoods, this approach requires discrete, binary or quantized 
attributes. In contrast, our problem domain deals with high dimensional measurements (im- 
ages consisting of million of pixels), so we develop a discriminative learning approach and 
formulate a multi-stage empirical risk optimization problem to reduce measurement costs 
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and misclassification errors. At each stage, we solve the reject classification problem by 
factorizing the cost function into classification and rejection decisions. We then embed the 
rejection decision into a binary classification problem. 

Single Stage Reject Classifiers: Our paper is also closely related to the topic of reject 
classifiers, which has also been investigated. However, in the literature reject classifiers 
have been primarily considered in a single stage scenario. In the Bayesian framework, 
[Cho w, 1970| introduced Chow's rule for classification. It states that given an observation 
x and a reject cost 8 and J classes, reject x if the maximum of the posteriors for each class 
is less than the reject cost: max^i ,jP(y = j\x) < 8. In the context of machine learning, 
the posterior distributions are not known, and a decision rule is estimated directly. One 
popular approach is to reject examples with a small margin. Specifically, in the context of 
support vector machine classifiers, | Yuan and Casasent, 2003 Bartlett and Wegkamp, 2008 
Rodriguez-Diaz and Castanon, 2009 Grandva let et al., 2008| , define a reject region to lie 
within a small distance (margin) to the separating hyperplane and embed this in the hinge 
loss of the SVM formulation. [El-Yaniv and Wiener, 201 1 1 proposes a reject criteria moti- 
vated by active learning but its implementation turns out to be computationally impractical. 
In contrast, we consider multiple stages of reject classifiers. We assume an error prone sec- 
ond stage which occurs in such fields as threat detection and medical imaging. In this sce- 
nario, rejecting in the margin is not always meaningful. Fig. [3] illustrates that thresholding 
the margin to reject can lead to significant degradation. This usually happens when stage 
measurements are complimentary; then examples within a small margin of the 1st stage 
boundary may not be meaningful to reject. Multiple stages of margin based reject classifiers 
have been considered by |Liu et al., 2008[ using SVMs in image classification. The method 
does not take into account the cost of later stages and is similar to the myopic method that 
we compare in the Experiments section. 

Detection Cascades: Our multi-stage sequential reject classifiers bears close resemblance 
to detection cascades. There is much literature on cascade design (see [Zhang and Zhang , 2010| 
|Chen et al., 2012| and references therein) but most cascades roughly follow the set-up intro- 
duced by [Viola and Jones, 2001 1 to reduce computation cost during classification. At each 
stage in a cascade, there is a binary classifier with a very high detection rate and a mediocre 
false alarm rate. Each stage makes a partial decision; it either detects an instance as negative 
or passes it on to the next stage. Only the last stage in the cascade makes a full decision, 
namely, whether the example belongs to a positive or negative class. 

There are several fundamental differences between detection cascades and the multi- 
stage reject classifiers (MSRC). A key difference is the system architecture. Detection cas- 
cades are primarily concerned with binary classification problems. They make partial deci- 
sions, delaying a positive decision until the final stage. In contrast, MSRCs can make full 
classification decisions at any stage. Conceptually, this distinction requires a fundamentally 
new approach; detection cascades work because their focus is on unbalanced problems with 
few positives and a large number of negatives; and so the goal at each stage is to admit large 
false positives with negligible missed detections. Consequently, each stage can be associ- 
ated with a binary classification problem that is acutely sensitive to missed detections. In 
contrast, our scheme at each stage is a composite scheme composed of a classifier as well 
as a rejection decision. The rejection decision is itself a binary classification problem. In 
practice, MSRCs arise in important areas such as medical diagnosis and explosives detec- 
tion as we argued in Sec 1, item (B). As a performance metric detection cascades tradeoff 
missed detections at the final stage with average computation. MSRCs tradeoff average 
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misclassification errors against number of examples that reached later stages (i.e. required 
more sensors or sensing modalities). For these reasons it is difficult to directly compare al- 
gorithms developed for MSRCs to those developed for detection cascades. Nevertheless, our 
goals and resulting algorithms are similar to some of the issues that arise in cascade design 
(see [ Che n et al., 2012| and references therein), namely, perform a joint optimization for all 
the stages in a cascade given a cost structure for different features. 



Other Cost Sensitive Methods: Network intrusion detection systems (IDS) is an area 
where sequential decision systems have been explored, (see ]Fan et al., 2000||Lee et al., 2002 
Cordelia and Sansone, 2007 1). In IDS, features have different computation costs. For each 
cost level, a ruleset is learned. The goal is to use as many low cost rules as possible. In a re- 
lated set-up, |Fan et aL72 002 Wan g et al., 2003| consider a more general ensemble of base 
classifiers and explore how to minimize the ensemble size without sacrificing performance. 
In the test phase, for a sample, another classifier is added to the ensemble if the confidence 
of the current classification low. Here, similar to detection cascades, the goal is to reduce 
computation time. As we described in Sec 1, item (C), the important distinction is that, in 
our setting, a decision is based only on the partial information acquired up to a stage. In 
a computation driven method, a stage (or base classifier) decides using a feature computed 
from the full measurement vector. 




Fig. 3 (a) Gaussian Mixture (binary), (b) Error rate vs reject rate on complementary measurements. 1st 
stage uses only dim 1. 2nd stage uses only dim. 2. Myopic strategy (green) is thresholding the margin of the 
classifier, our method is global surrogate; Bayesian classifier (best performance). Thresholding the margin 
performs significantly worse than our method. 



2 Problem Statement 

Let (x,y) 6 x { 1 , 2, . . . C} be distributed according to an unknown distribution 2>. A data 
point has K features, x = {x\,X2, . . . ,xk], and belongs to one of C classes indicated by its 
label y. A Mi feature is extracted from a measurement acquired at Mi stage. We define a 
truncated feature vector at Mi stage: x^ = {x\,X2,. . -x^}. Let 3C k be the space of the first k 
features such thatx* e 5C k . 

The system has K stages, the order of the stages is fixed, and Mi stage acquires a Mi 
measurement. At each stage, k, there is a decison with a reject option, / . It can either 
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classify an example, f k (x k ) : 3C k — > {1,2,... ,C}, or delay the decision until the next stage, 
f k (x k ) = r and incur a penalty of 8 k+l . Here, r indicates the "reject" decision. f k has to 
make a decision using only the first k sensing modalities. The last stage K is terminal, a 
standard classifier. Define the system risk to be, 

R(f 1 ,...,f K ,x,y) = '£S k (x k )R k (f k ,x k ,y) (1) 

Here, R k is the cost of classifying at Mi stage, and S k (x k ) 6E {0, 1} is the binary state variable 
indicating whether x has been rejected up to km stage. 



8 k+1 , f k (x k ) = r 

1, /*(**) A 



If x is active and is misclassified, the penalty is 1 Q If it is rejected then the system incurs a 
penalty of 8 k+l , and the state variable for that example remains at 1. 

S k+1 (x k+1 ) = f Sk{xk) > fk{sk) = r , S ' = l (2) 
I 0, else 



2.1 Bayesian Setting 

In this section, we will digress from the discriminative setting and analyze the problem 
under the assumption that the underlying distribution & is known. In doing so, we hope to 
discover some fundamental structure that will simplify our empirical risk formulation in the 
next section. 

If & is known the problem reduces to an POMDP, and the optimal strategy is to mini- 
mize the expected risk, 

mm E & \R(f l ,...J K ,x k 7 y)] (3) 

If we allow arbitrary decision functions then we can equivalently minimize conditional risk, 

min E\R(f\...,f K ,x k ,y)\x\ (4) 

This problem — by appealing to dynamic programming — remarkably reduces to a single 
stage optimization problem for a modified risk function. To see this, we denote the cost- 
to-go, 



£ S'(x')R l (f',x',y)\x k .,S k (x k ) = l 

t=k+\ 



8 k (x k ) = 8 k+1 + mm E 
/*+'.../* 

and the modified risk functional, 

R k (x k ,y,f k ,8 k ) 

and prove the following theorem, 

' To simplify our discussion, we consider equal error penalties. However, our approach can be easily 
extended to unbalanced error penalties as we will demonstrate in the experiments section 



8 k (x k ), f k {x k ) = r 

1, /V)/yA/V)/, 
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Theorem 1 The optimal solution f l ,f 2 , ■ ■ ■ f K to the multi-stage risk in Eq. ^decomposes 
to single stage optimization, 



f = a rgminE\R k (x k ,y,f,8 k )\x' 



and the solution is: 



/V) 



f y, P(x*) > 1 - 8 k (x k " 



[reject, P(x k ) < 1 - S k (x k ^ 



(5) 



(6) 



y = argmaxP(y = j \ x k ), P(x*) = maxP(y = j \ x k ) 

j j 

Proof To simplify our derivations, we assume uniform class prior probability: V y \y = y] = 
~, y = 1 , . . . , C. However, our results can be easily modified to account for a non-uniform 
prior. The expected conditional risk can be solved optimally by a dynamic program, where 
a DP recursion is, 

J K (x K ,S K ) = minEy [S K (x K )R k (y,x K ,f K ) | x K ] (7) 
f K 

J k (x k ,S k ) = min{E v [s k (x k )R k (y,x k ,f k ) | x A '] + E X * +1 ,„ X * [j k+1 (x k+l ,S k+1 ) \ x k ] } (8) 

Consider kth stage minimization, f k can take C + 1 possible values {1,2,...C, r} and 
J k (x k ,S k ) can be recast as an conditional expected risk minimization, 



J k (x k ,S k = 1) =mm{ 

f 



P y [y / y | x*] , 8 k + E xM ^ K [f k+l (x k+l , 1) | x*] 



/*(«* 



^ (9) 



Define, 



= +E x *+i...^ [J k+l (x k+l ,S k+l = 1) | x* 
and rewrite the conditional risk in|9] 



f = argmin < 

5 f ^ 



l-P y [y = j>|x*],S*(x*) 

' 7^)~ ' /(x * )=r 



(10) 



Reject is the optimal decision if, 

min{l-P y [y = y | x*] } > 8 k (x k ) max|p y [y = y | x*] } < 1 - ^(x*) (11) 

If reject is not the optimal strategy then a class is chosen to maximize the posterior proba- 
bility: 



/V) = arg max hJy = y\x k }\ 

ye{l,...,c] I L J J 



(12) 



which is exactly our claim. 



Multi-Stage Classifier Design 



9 



The main implication of this result is that if the cost-to-go function S k (x k ) is known 
then the risk /?*(■) is only a function of the current stage decision f k . Therefore, we can 
ignore all of the other stages and minimize a single stage risk. Effectively, we decomposed 
the multi-stage problem in Eq.[4]into a stage-wise optimization in Eq.[5] 

Note that the modified risk functional, Rf,, is remarkably similar to Rf. except that the 
modified reject cost 8 k (x k ) replaces the constant stage cost 8 k . Also, consider the range for 
which 8 k (x k ) is meaningful. If we have C classes then a random guessing strategy would 
incur an average risk of 1 — p. Therefore the risk for rejecting, 8 k (x k ) < 1 — g in order to be 
a meaningful option. The work in [Cho w, 1970| contains a detailed analysis of single stage 
reject classifier in a Bayesian setting. 



f P(y = l|x» 




Fig. 4 Optimal Reject Region can be expressed as the disagreement region of two binary classifiers (/„ and 



Reject Classifier As Two Binary Decisions: Consider a stage k classifier with a reject option 
from Theorem 1 in a binary classification setting, y € {—!,+!}. 



/V) 



+ 1, P(y= 1 |x*) > \-8 k (x k ) 

-1, P{y=l\x k )<8 k (x k ) (13) 
reject, 8 k (x k ) < P(y = 1 1 jr*) < 1 - 8 k (x k ) 



It is clear from the expression that we can express the decision regions in terms of two 
binary classifiers /„ and f p . Observe that for a given reject cost 8 k (x k ), the reject region is an 
intersection of two binary decision regions. To this end we further modify the risk function 
in terms of agreement and disagreement regions of the two classifiers, /„, f p , namely, 



8(x k ), f n (x k ) ^ f„(x k ) 
1- f„(x k )=f p (x k )Af p (x k )^y 



Note that the above loss function is symmetric between /„ and f p and so any optimal solution 
can be interchanged. Nevertheless, we claim: 



Theorem 2 Suppose f n and f p are two binary classifiers that minimize E I (x k 7 y,f n ,fp, 8 k 
over all binary classifiers f„ and f p . Then following resulting reject classifier: 

jf p (x k ), f n (x k ) = f p (x k ) 
reject, f n (x k ) f p {x k ) 



j*tf) = ^* r ;> 7;-; (is) 
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is the minimizer for E ^R^(yf,y,f,8) | x | in Theorem 1 and the kth stage minimizer 
Eq.\3\ 

Proof For a given x* and 8(x k ), 



minEy [R k (x k ,y,f,B k )\^ 



mmE y \L k (x k ,y,f p ,f n ,8 k )\^ 



min { P y \y = - 1 | x* , P y\ y = + 1 | x* , 8 (x' 



/=+i 



/=-i 



/=reject 



mm < 

fpjn 



P, [y = -1 | ^J.Py^+l | «*],§*(«*) 

v — - — " — ^ — ' Vy?^ 

I / P =+i,/„=+i / P =-i,A=-i /p ^" 



By inspection, the decomposition in 



15 



is the optimal bayesian classifier minimizing E v LRt(x*,y,/, 5*) | x* 

□ 

We refer to Fig|4]for an illustration. We can express the new loss compactly as follows: 

L k (x k ,y,f p ,f,J k ) = l^^^l^^+S'^l^^^] (16) 

Note that in arriving at this expression we have used: 1 r a ^ c i 1 t a= a = lr a j c ilr^ c ]. 

In summary, in this section, we derive the optimal POMDP solution and decouple a 
multi-stage risk to single stage optimization. Then, for the binary classification setting, we 
derive an optimal representation for a reject region classifier in terms of two biased binary 
decisions: 



minE[fl(x,y,. ..,/*,...] ^ minE[^(x*,y,/,5*)] -> mmE[L k (x k ,y,f k ,f k ,8 k ) 



2.2 Stage-wise Empirical Minimization 

In this section, we assume that the probability model & is no longer known and cannot 
be estimated due to high-dimensionality of the data. Instead, our task is to find multi-stage 
decision rules based on a given training set: (x\,yi), (x2,y2), . . . , {x^,y^). Here, we consider 
binary classification setting: y; 6 {+1,-1}. 

We will take advantage of the stage-wise decomposition of the POMDP solution in 
Theorem 1 and parametrization of reject region in Theorem 2 to formulate an empirical 



version of the stage risk L k (-) in Eq. 16 However, this requires the knowledge of the cost- 
to-go, 8 k : 3£ k — > R. Instead of trying to learn this complex function, we will define a 
point-wise empirical estimate of the cost-to-go on the training data: 

5*(if)->$* ,i= l,2,...iV 

and use it to learn the decision boundaries directly. 

Note that by definition, 8 (xf ) is a only function of f k+l , . . . ,f K . So the cost-to-go 
estimate is conveniently defined by the recursion, 

8 k - l =L k {x k , yi J k f k ,8 k ) + 8 k ^i (17) 
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Now, we can form the empirical version of the risk in Eq[3]and optimize for a solution 
at stage k over some family of functions, J? k . 

{/ p V),/„V)}=arg mm if S*L k tf, yi ,f p ,f n ,%) (18) 

IpJn€-\& } ™ i—\ 

Observe that, as in standard setting, we need to constrain the class of decision rules 

&k x &k here. This is because with no constraints the minimum risk is equal to zero and 

can be achieved in the first stage itself. 

Note, our stage-wise decomposition significantly simplifies the ERM. The objective in 
Eq. 18 is only a function of given 5* and the state Sf, To minimize an empirical 

version of a multi-stage risk in Eq.[3]is much more difficult due to stage interdependencies. 

Given 5* and all the stages but the kth, we can solve 18 by iterating between /* and 
To solve for /*, we fix _/* and minimize a weighted error 

fp = ar s^ I Wil > Wi = s * I 1 m)M + ~ 5 > ~ 21 m)M s < ] (19) 

We can solve for /„ in the same fashion by fixing f p , 

f" = I ' Wi = s < I 1 m)M + ~ 8 > - 21 m)M ~ 5 ^\ (20) 

To derive these expressions from^J we used another identity for any binary variables a,b,c 
1 [a£b] = 1 [ajtc] + 1 [byte] ~ 21 [ctjtc] 1 [b^c] ( 2 1 ) 



3 Algorithm 

Minimizing the indicator loss is a hard problem. Instead, we take the usual ERM (empirical 
risk minimization) ([Friedma n et al., 2001 1 ) approach and replace it with a surrogate. We 
introduce an algorithm in the boosting framework based on the analysis from the previous 
section. Boosting is just one of our many possible machine learning approaches that can be 
used to solve it. We use boosting because it is easy to implement and is known to have good 
performance. 

Boosting is a way to combine simple classifiers to form a strong classifier. We are given 
a set of such weak classifiers Jf = {/ii(x),/i2(x) . . ./zm(x)}, hj(x) € {—1, +1}. The strong 
classifier is the linear combination: 



F(x) = sgn 



This set of weak classifiers need not be finite. Also, denote Ml C as a subset of weak 
classifiers that operate only on the first k measurements of x. hj(x) = hj(x k ) if hj 6 Ml . 
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Global Surrogate: In our algorithm, we use the sigmoid loss function C(z) = l+e \ p ^ to 
approximate the indicator. Similar sigmoid based losses have been used in boosting be- 
fore ([Masnadi -Shirazi and Vasconcelos, 2009| |). Each subproblem ( |19[ l reduces to boosting 
a weighted loss 

To solve for stage k, we keep the rest of the stages constant. To find f™ = L<7)/i)( X ), we 
fix /* and solve: 

N ( \ 

/* = arg miri ^vv,C Vj £ qjhj(xi) (22) 

Note that the weights w,, state variables S* and cost-to-go Sj' are also expressed in terms of 
the C(z) instead of lr z i: 

w t = [C(yj*(ii)) + §? - 2C(yj*(xi))8?\ (23) 
To solve for /*, we solve the same problem but keep f* constant instead: 

/;f = arg min 2>,C|y, £ ?A( X | ( 24 ) 



Wi = S? [C(y/*( Xi )) + # - 2C(y/*( Xi ))5f j 

Note that the terms 5* and do not depend on stage k and remain constant when solving for 
fp and /*. For the ease of notation, we define a new term C,- that indicates if x, is rejected 
at a km stage. The term is close to one if /* and f* disagree (reject) and small if they agree. 

C r (f k p ,f^, yi ) = C(yd*(*t)) + C(y ; /*(x ; )) - 2C(y ; 4( Xi ))C(^( Xi )) 

The expressions for state variables and cost-to-go are now simplified. 

$ ¥l =$C r (J*,j*rf,y),S} = l (25) 

The state variable remains greater than zero as long as X| - is rejected at every stage. The 
expression for cost-to-go at Mi stage is: 

5f= 5*+l +C(y4 +1 ( X f +1 ))C(y i y;f 1 ( X f +1 ))+5/+ 1 C r (4 +1 ,/;f +1 , X f +1 ,y) (26) 

meas^cost " ' " v ' 

err. penalty if not rejected at stage k+ 1 cost-to-to if rejected at stage k+ 1 



The last two terms are simply a surrogate for from 16 in terms of C(-). 

For the last stage (a standard binary classifier), we fix the first K — 1 stages and solve: 



/* = arg min £ SfC U £ qjhjfr) ) (27) 



'•"i=l 



Our algorithms performs cyclical optimization over the stages. To initialize /„ , ft \/k. 
'p 



we simply hard code /* to classify any X as +1 and f* as -1 so that all X 's are rejected to the 



last stage. Using these nominal classifiers, we compute S\ and 5* according to equations 
and|26j respectively. 
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At a stage k, for a fixed 8f and , we alternate among minimizing /* and /* according 
to equations [22] and [24] In practice, we found that one iteration is sufficient. 

Given a new estimate of stage k, we update 8- for s > k and fors<k and then move 
on to optimizing another stage k! . Given an estimate for stage k' , we again update the state 
variables and cost-to-go for the rest of the system. 

The stages are optimized in the following order. We start with the last stage and make 
our way backwards to the first stage. Then do a forward pass from 1st stage to last. These 
forward and back passes are repeated it until convergence. See Algorithm]!] 



Algorithm 1 Global Algorithm 

INPUT: {-v i ,.y,}jl 1 , { M}f = i {Weak Learners for each stage}, {5 { }f =1 {costs}, D { Loop Iterations} 
INITIALIZE: j*(x) <- +1,/*(jc) <- - 1, for k=\...K-\ {first K - 1 stages reject everything} 
forrf = l,...,Ddo 

tork = K,...,l,2,...R-\ do 

{Start from the last stage then iterate to the first stage and then back to last stage} 
'tfk<K then 

Find /* by solving boosting subproblem in |22| 
Find /* by solving boosting subproblem to |24| 
else if k = K then 
{Last Stage} 

Find f K (x) by solving boosting subproblem in |27| 
end if 

Update 8* for * > k and 5* for s < k 
end for 
end for 

F k (x") <- { Sg " [ / ^ (X * ) ] ' Sgn [ / ^ (X * ) ] = Sg " [ f " (X<r) ] 
\ reject, if sgn [/*(**)] ^ sgn [j*(x)} 

OUTPUT: F i ,F 2 ,...,F K 



Our formulation allows us to form a surrogate for the entire risk in Equation]!] not just 
for each subproblem. This enables us to prove the following theorem, 

Theorem 3 Our global surrogate algorithm converges to a local minimum. 

Proof This is simply due to a fact that we are minimizing a global smooth cost function by 
coordinate descent over qj,,q,',q^,q^, . . . ,q K . Here, q* is the vector of weak learner weights 
parametrizing f k p . For the derivation of three stage system global cost refer to Appendix 8. 

However, since the global loss and the loss for each subproblem are non-convex programs, 
there is no global optimality guarantee. Theorem 3 ensures that our algorithm terminates. 



Regularization to reduce overfitting: To reduce overtraining, we introduce a simple but ef- 
fective regularization. For any loss C(z) and a parameter A, we introduce a multiplicative 
term to the cost function:min q e;cp(A|q|)£^ =1 C(;y;£/, . g ,#> 1jhj( x i)) The term exp(A|q|) lim- 
its how large a step size for a weak hypothesis can become. It also introduces a simple stop- 
ping criteria: abort if ^"' = 1 ' c(y'jt 0^))" ' — ^" corres P on ds to a situation when no 
descent directions ( weak hypothesis h t +\ ) can be found to minimize the cost function. 
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4 Generalization Error 

Our system is composed of margin maximizing classifiers, therefore it is appropriate to de- 
rive generalization error bounds based on margins. It turns out that we can employ maximum 
margin generalization techniques from [Bartlett et al., 1998 ] to derive error bounds for a two 
stage version of the system. A two stage system consists of three boosted binary classifiers: 

/>') = E /»(*')= E WM 1 ), /V)= E i) h M 2 ) 

hjZJf? 1 hje.3e l h j &.M' 2 



Theorem 4 Let 2 1 be a distribution on x {+1,-1}, and let be a sample of m exam- 
ples chosen independently at random according to 2>, and a rejected subsample of size m r , 
■5^ r = {i£y|/p(x) ^ /„' (x)} Assume that the base-classifier spaces M{ and M2 are finite, 
and let 8 > 0. Then with probability at least 1 — 8 over the random choice of the training 
set S, all boosted classifiers /„' ,fp,f 2 satisfy the following bound for all d\ > and 62 > 0: 

\yfn to < 0,yf p (x) < 0] + Voj [yf (X) < 0, fl (x) jt f p (x)] < 

p^b/n'to < OuyfpW < ei]+P. x b/ 2 (x) < e 2 ]+ 

1 famioem +log iy\ j 1 ( io gmA o gm m 



m \ &i 8 J I \ \fm r \ 02 8 

Proof The proof extends the approach in |Bartl ett et al., 1998) to a two stage system. For 
complete details please refers to the appendix. 

The two stage system can be compactly expressed: 

1 j \sgn[/V)], sgn^Cxl^sgn^to)] 1 *> 

The system error is a sum of two terms: error at the 1st stage + error at the 2nd stage. 
Theorem 4 states the generalization error of F(x) is bounded by the empirical margin error 
over the training set S plus a term that is inversely proportinal to the margins and the number 
of training samples at that stage. An interesting observation is that m r , number of samples 
that reaches the 2nd stage, depends on the reject classifier at the 1st stage. So if very few 
examples make it to the second stage then we do not have strong generalization. 



5 Experiments 

The goal is to demonstrate that a large fraction of data can be classified at an early stage 
using a cheap modality. In our experiments, we use four real life datasets with measurements 
arising from meaningful stages. 



5.1 Related Algorithms: 



We compare our algorithm to two methods: 
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Myopic: An absolute margin of a classifier is a measure of how confident a classifier is 
on an example. Examples with small margin have low confidence and should be rejected 
to the next stage to acquire more features. This approach is based on reject classification 
([Bartlett and Wegkamp, 2008]). We know from Claim 1 that the optimal classifier is a 
threshold of the posterior. For each stage, we obtain a binary boosted classifier, /*(•), trained 
on all the data. We then threshold the margin of the classifier, \f k (x k )\. It is known that given 
an infinite amount of training data, boosting certain losses (sigmoid loss in our case) ap- 
proaches the log likelihood ratio, /(x) = \ log p^l.lr'j^ C HMasnadi-Shirazi and Vasconcelos, 2009| ). 

So a reject region for a given threshold is defined: {x | \f k (x)\ < t^}. This is a completely 
myopic approach as the rejection does not take into account performance of later stages. 
This method is very similar to TEFE ( |Liu et al., 200 8]) which also uses absolute margin as 
a measure for rejection. The difference is that our myopic strategy is a boosting classifier 
not an SVM as used in TEFE. 

Expected Utility/Margin: An expected margin difference measures how a new attribute, 
if acquired, would be useful for an example. If this expected utility for an example is 
large then a new attribute should be acquired. This approach is based on the work by 
|Kanani and Melville, 2 008 1 . We train boosted binary classifiers on all the data for each 
stage: f k (x k ). Given the measurement at the current stage x*, we compute an expected util- 
ity (change in normalized margin) of acquiring the next measurement xj. + i : 

u(* k )= E |/V)-/ +1 ([xWi])|p(^ + iW 

An x* is rejected to the next stage if its utility U(x k ) > is greater than a threshold. Here, 
%~k+i denotes the possible values thatj^+i can take. Note this approach requires estimating 
P(xk + i |x*]r] therefore the (k+ l)th measurement has to be discrete or distribution needs to 
be parametrized. Due to this limitation, we only compare this method on two datasets. 



5.2 Simulations 

Performance Metric: A natural performance metric is the trade off between system er- 
ror and measurement cost. Note, for utility and myopic methods, it is unclear how to set 
a thresholds fj for each stage given a measurement cost 8^. For this reason, we only com- 
pare them in a two stages system. More than two stages is not-practical because we would 
need to test every possible for every stage k. In a two stage setting, measurement cost is 
proportional to the fraction of examples rejected to the second stage. For our algorithm, we 
vary a reject cost 8 to generate a system error vs reject rate plot. For margin and utility, we 
sweep a threshold fy. System error is the sum of 1st stage and 2nd stage errors. Reject rate 
is the fraction of examples rejected to the 2nd stage and require additional measurements. 
Low reject rate (cost) corresponds to higher error rate as most of the data will be classified at 
the first stage using less informative measurements. High reject rate will have performance 
similar to a centralized classifier, as most examples will be classified at the 2nd stage. 

2 While there are many different ways to estimate a probability likelihood we used a Gaussian mixture due 
to its computational efficiency 
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Set Up: In all our experiments, we use stumps [j as weak learners. For each dataset and 
experiment, we randomly split the data 50/50 for training and testing. The results are evalu- 
ated on a separate test set, and the simulations are averaged over 50 monte-carlo trials. The 
number of iterations for each boosting subproblem is set to T = 50. In our global surrogate 
algorithm, the number of outer loop iterations is set to D = 10 



Name 


Size 


1st Stage 


2nd Stage 


Gassian Mixture 


1000 


1st dim 


2nd dim 


Mammogram Mass 


830 


3 CAD meas. 


Radioligist Rating 


Pima Diabetes 


810 


6 simple tests: BMI, sex, .. 


2 blood tests 


Polyps 


310 


12 freq. bins 


126 freq. bins 


Threat 


1300 


Images in IR, PMMW 


Images in AMMW 



Table 1 Dataset Descriptions 



Discrete Valued Data Experiments: To compare our method to the utility approach, we 
consider discrete data. The first dataset is a quantized (with 20 levels) Gaussian mixture 
synthetic data in two dimension. The 1st dimension is stage one; the 2nd dimension is stage 
two. The second dataset is Mammogram Mass from UCI Machine Learning Repository. It 
is used to predict the severity of a mammographic mass lesion (malicious or benign). It con- 
tains 3 attributes extracted from the CAD image and also an evaluation by a radiologist on a 
confidence scale in addition to the true biopsy results. The first stage are features extracted 
from the CAD image, and the second stage is the expert confidence rated on a discrete scale 
1 — 5. Automatic analysis of the CAD image is cheaper than employing an opinion of a 
radiologist. 

Simulations in Fig. [5] demonstrate that utility performs worse when compared to our 
approach. This is possibly due to poor probability estimates in limited data setting. 
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Fig. 5 Comparison of Global to Utility on (a) quantized two gaussian clusters and (b) mammogram dataset. 
Reject Rate vs System Error. Reject Rate is the fraction of examples with measurements from both stages. 
Our approach outperforms Utility possibly because we do not need to estimate probability likelihoods 



3 stump classifier is threshold on dth dimension: 'v.g, {+!/-!} i x ) = {+V ~ ^} s >g n ( x (d) — g) 
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Continuous Valued Data Experiments We compare our global method to the myopic method 
on three datasets. The Pima Indians Diabetes Dataset (UCI MLR) consists of 8 measure- 
ments. 6 of the measurements are inexpensive to acquire and consist of simple tests such 
as body mass index, age, pedigree. These we designate as the first stage. The other two 
measurements constitute the second stage and require more expensive procedures. 

The polyp dataset consists of hyper-spectral measurements of colon polyps collected 
during colonoscopies (|Rodnguez-Di'az and Castanon, 2009 1). The attribute is a measured 
intensity at 126 equally spaced frequencies. Finer resolution requires higher photon count 
which is proportional to acquisition time. For a first stage, we use a coarse measurement 
downsampled to only 12 frequency bins. The second stage is the full resolution frequency 
response. Using the course measurements is cheaper than acquiring the full resolution. 

The threat dataset contains images taken of people wearing various explosives devices. 
The imaging is done in three modalities: infrared (IR), passive millimeter wave (PMMW), 
and active millimeter (AMMW). All the images are registered. We extract many patches 
from the images and use them as our training data. A patch carries a binary label, it either 
contains a threat or is clean. IR and PMMW are the fastest modalities but also less informa- 
tive. AMMW requires raster scanning a person and is slow but also the most useful. 
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Fig. 6 Three datasets are evaluated: pima, polyps and threat. Reject Rate vs Error Rate for a varying reject 
cost 5. Reject Rate is the fraction of examples with measurements from both stages. Global and Myopic 
are compared. Global (our approach) has a better performance over all while Myopic does better in some 
situations. 



Name 


Centralized 


Utility 


Myopic 


Ours 


2D Gaussian Mix 


0.09 


50% 




30% 


Mammogram 


0.165 


60% 




15% 


Pima Diabetes 


0.26 




60% 


45% 


Polyps 


0.24 




75% 


50% 


Threat 


0.185 




50% 


45% 



Table 2 Performance illustration for different datasets (quantitate view of the curves). Datasets have 2 sens- 
ing modalities. Centralized denotes the test error obtained with all modalities. Last three columns denotes 
performance for different approaches. Performance is measured by the average number of examples requir- 
ing 2nd stage to achieve error close to centralized. Utility approach does not work for last three datasets due 
to high-dimensionality issues. We note the significant gains of our approach over competing ones of many 
interesting datasets. 
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In Fig. [6] global performs better than margin in most cases. On threat data, margin 
appears to be doing just marginally worse than global, however, we get only a few points on 
the curve with reject rates less than 50%. Due to the heuristic nature of margin, we cannot 
construct a multistage classifier with an arbitrary reject rate. 

The goal is to reach the performance of a centralized classifier (100% reject rate) while 
utilizing the 2nd stage sensor only for a small fraction of examples. Overall, the results 
demonstrate the benefit of multi-stage classification: rejection rate can be set to less than 50% 
with only small sacrifices in performance. For the mammogram data, this implies that for 
half of the patients a diagnoses can be made solely by an automatic analysis of a CAD im- 
age without an expensive opinion of a radiologist. For the Pima data, similar error can be 
achieved without an expensive medical procedures. For the polyps dataset, a fast low reso- 
lution measurement is enough to classify a large fraction of patience. In the threat dataset, 
IR and PMMW are sufficient to decide whether or not a threat is present for the majority of 
instances without requiring a person to go through a slower AMMW scanner. 

Unbalanced False Positive and False Negative Penalties: In medical diagnosis and threat 
detection, the penalty of false positives and false negatives is not equal. We can easily adapt 
our algorithm to account for such setting. Empirical Risk in[l8]can be modified to include 
a penalty of w p for a Type I error and w n for a Type II error. The experiment in Fig. [7] 
demonstrates our global algorithms in such scenario. For each reject cost 8, we compute an 
ROC curve. We also compute a corresponding average reject rate for each value of 8. So 
the highest reject rate corresponds to the best performance but also to the highest acquisition 
cost incurred by the system. Note that very good performance can be achieved by requesting 
only 50% of instances to be measured at the second stage. 




Fig. 7 Two Stage ROC using the global surrogate method. Each ROC curve corresponds to a different value of 
reject cost S. The legend displays average reject rate for S's. Note, the red ROC corresponds to the centralized 
system (100% reject rate). Very good performance can be achieved by requesting only 50% of instances to 
be measured at the second stage. 



Three Stages: Lastly, we demonstrate a three stage system, we apply our algorithm to three 
stages of threat dataset. Note for margin it is unclear how to generalize it to a multistage 
scenario and there is no way to define reject costs for different stages. We set the first stage 
to be IR, second PMMW and AMMW as third. There is no cost for acquiring IR. We vary 
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the costs for the PMMW (2nd) stage, Si, and AMMW (3rd), Si, to generate an error map 
(color in Fig. [8}. A point on the map corresponds to a performance of a particular multistage 
classification strategy. The vertical axis is the fraction of examples for which only IR and 
PMMW measurements are used in making a decision. The horizontal axis is the fraction 
of examples for which all three modalities are used. For example, a red point in the figure, 
{.4, .15, .195}, correspond to a system where 40% of examples use IR and PMMW, 15% use 
only IR and the rest of data (45%) use all the modalities. And this strategy achieves a system 
error rate of 19.5%. Note that the support lies below the diagonal. This is because the sum 
or reject rates has to be less than one. Results demonstrate some interesting observations. 
While best performance (about 19%) is achieved when all the modalities are used for every 
example, we can move along the vertical lines and allow a fraction to be classified by IR and 
PMMW, avoiding AMMW all together. This strategy achieves performance comparable to 
a centralized system, (IR+PMMW+AMMW). 




Fig. 8 Three Stage System. The color maps error. A point on the map corresponds to a performance of a 
particular multistage classification strategy. The vertical axis is the fraction of examples for which only IR 
and PMMW measurements are used in making a decision. The horizontal axis is the fraction of examples 
for which all three modalities are used. An example red point in the figure, {.4, .15, .195}, correspond to a 
system where 40% of examples use IR and PMMW, 15% use only IR and the rest of data (45%) use all the 
modalities. And this strategy achieves a system error rate of 19.5%. 



6 Conclusion 

In this paper, we propose a general framework for a sequential decision system in a non- 
parametric setting. Starting from basic principles, we derive the bayesian optimal solution. 
Then, to simplify the problem, we parameterize a classifier at each stage in terms of two bi- 
nary decisions. We formulate an ERM problem and optimize it by alternatively minimizing 
one stage at a time. Remarkably, all subproblems turn out to be weighed binary error mini- 
mizations. We introduce a practical boosting algorithm that minimizes a global surrogate of 
the empirical risk and test it on several datasets. Results show the advantage of our formu- 
lation to more heuristic approaches. Overall, our experiments demonstrate how multi-stage 
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classifiers can achieve good performance by acquiring full measurements only for a fraction 
of samples. 
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7 Appendix 

7. 1 Proof of Theorem 4 



Proof This will closely follow the proof of Theorem 1 in [Bartlett et a l., 1998| . We have to 
bound two terms: 

P®b/«W < 9l,yf P (x) < Bi] and P & \yf 2 (x) < 9 2 ,yf n (x) + yf p {x)\ 

First Term Let us bound the first term. Define to be the set of unweighted averages 
over N elements from J%\ , 



%={/:^ jV I'"iWI^^} 



(30) 



Any weighed classifier / = Y.h1hh{x) can be approximated by drawing an element from ^ 
by choosing h\...h^ with prob. q/,. 

We can express our first term as a sum of probabilities of disjoint events. 



P @ W P (x)<0,yMx)<0} = 

0\ 0\ 

yf p {x) < 0,yf„(x) < 0,yg p (x) < -j,yg n (x) < — 

0\ 0[ 

yf p (x) < 0,yf„(x) < 0,yg p (x) < —,yg n {x) > — 

0\ 0\ 

yfp(x) < 0,yf„(x) < 0,yg p (x) > -j,yg„(x) < — 

0\ 0[ 

yf p {x) < 0,yf n (x) < 0,yg p (x) > —,yg n (x) > — 



Further, we can write, 

P$\yf P (x)<0,yfn(x)<0}<P @ 



yg P ( x ) < ^,ygn(x) < y 



0\ 0\ 

yfp(x) < 0,yf„{x) < 0,yg p (x) > —,yg„{x) > — 



(31) 
(32) 

(33) 

(34) 

(35) 

(36) 
(37) 



The inequality holds for any g p ,g„. We take the expected value of the right hand side wrt to 
the distribution ^ 

P$»W P {x)<0,yfn(x)<0}< (38) 



yg P (x) < -^-,yg n {x) < y 

0\ 0\ 

ygp{x) > it ,yg n (x) > -7T yfp(x) < o,yf n {x) < o 



(39) 
(40) 



The last term inside the expectation is the probability that an average of N bernoulli random 
variables is larger than its expectation, we use a concentration result from Equation (4) in 
Theorem 1 of |Bartlett et al., 1998[ . 

-Ndf\ 





ygp(x) > y ,yg n W > y I yf P {x) < o,yf„(x) < o 



< exp 



(41) 
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To bound the first we use the result from Equation (5) in Theorem 1 of [Bart lett et al., 1 998 1. 
if we set e N = v /(l/2/n)log((AT+ l)\Jt\ \ 2N )/8 N , with probability at least 1 - 8 N , 



<p 



ySp(x) < y,yg«W < y 



+ £ N (42) 



for any choice of 6 and every distribution e W. Here, [] is probability taken with respect to 
a randomly drawn sample of size m from 
By the same argument as in inequality [37] 



-S ■'•„ ygp( x ) < ^~,ygn(x) < y 



< 



Ps[yf P (x)<e h yf n (x)<d l }+-E s 



yg P (x) < y I yfp(x) > e 



(43) 
(44) 



The expressions inside the expectation can be bounded using the same Chernoff bound result 
fromHTl 



ygp(x) < -j,yg n {x) < y I yfp( x ) > 0\,yf p (x) > Q\ 



< exp 



-Net 



(45) 



By setting 8m = 8/(N(N + 1)), and combining the terms, 

P®bf P (x)<0,yMx)<0]< (46) 



Ps \yf P {x) < Bi,yf n (x) < e{\+2exp 



-Net 



+ 2 



2in 



log 



N{N+\) 2 \M[\ 2N 



(47) 



By setting, N = (4/ e\ ) log (mj log 1 3%[ \ 2 ) , 



\yf P (x) < 0,yf„(x) < 0] < P S \yf p {x) < e u yf n (x) <0i] + , 



1 /logmlog I Jtfl 2 



e 



+ lo 



(48) 



Second Term Here we will bound the second term, P^\yfi{x) < d2,yf n (x) ^ yf p (x)\ 
Define a new distribution: 



D r 



cD(x,y), fp(x)^f„(x) 
0, fp{x)=f„{x) 



Rewrite: 



P$\yf2(x) < <h,yfn(x)^yfp(x)] < P a \yM*) < ^ I yMx)?yf p (x)] 

= P &r [yf2(x)<e 2 ] 



(49) 



(50) 
(51) 



Note that 5^ r is an iid sample from & r . Using Theorem 1 in |Bartlett et al., 1998) 

P%\yf2{x) < o] < Py r [yf 2 (x) <e 2 } + ( 

Collecting the two terms produces the desired result 



1 /log/wlog I Mt\ 1\ 2 
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8 Derivation of a global risk for a three stage system 

Consider a three stage system. Define some terms: 



Error Indicator: 1 [/(x¥y] -> C(y/(x)) = ^ - eX p (y/(x)) (52) 

Reject Indicator: 1 [/p(x) ^ AW] -> (53) 

C r (/ p ,/„,x,y) =C(y/ p (x)) + C(y/ n (x))-2C(3;/ p (x))C(y„/(x)) (54) 

Risk for three stages: 

R(f l p ,flf 2 p jlf 3 ,x,y)=S 1 R i + S 2 R 2 + S 3 R 3 (55) 

S 1 = 1 (56) 

S*(f 1 p ,ti,x,y)=C r (f l p ,fl,x 1 ,y) (57) 

s\flJlf 2 p Jn,*,y) = C r (/;,/J,x^y)C r (/ 2 ,/ 2 ,x 2 ,y) (58) 

R\f p Jl*,y) = C(yf l p (x 1 ))C(yti(x 1 )) + 8 2 C r (f l p ,f 1 n y,y) (59) 

R 2 (f p ,f 2 ,x,y) = C(y/ 2 (x 2 ))C(y/ 2 (x 2 )) + S 3 C r (/ 2 ,/ 2 ,x 2 ,y) (60) 

fl 3 (/ 3 ,x,y) = C(y/ 3 (x 3 )) (61) 

(62) 

Plug in all the terms: 

/?(•) =C(y/;(x 1 ))C(>/„ 1 (x 1 )) + 5 2 C,.(/;,/„ 1 ,x 1 ,y) (63) 

" v ' 

+ C r (/;,/„ 1 ,x', y ){C(>-/ 2 (x 2 ))C(>-/ 2 (x 2 )) + 5 3 C r (/ 2 ,/ 2 ,x 2 ,y)} (64) 

v v v ' 

S 2 R 2 

+ C r (/;,/„ 1 ,x 1 ,y)C,-(/ 2 ,/ 2 ,x 2 ,y)C(y/ 3 (x 3 )) (65) 

v v ' V v ' 

Minimize over , /„' and keep f p ,f 2 , f 3 constant. We can rearrange the terms to get: 

argrnin^^/;,^ 1 ,/ 2 ,/ 2 ,/ 3 ^,-,^) = (66) 

fp ifn i 

argmin^C(y/;(x 1 1 ))C(y/„ 1 (x, 1 )) + 5/C r (/ / !,/„ 1 ,x 1 ,y) (67) 

jpiJn i 

such that: (68) 

5/ = 5 2 + {C(y/ 2 (x 2 ))C(y/ 2 (x 2 )) + 5 3 C r (/ 2 ,/ 2 ,x 2 ,y)} (69) 

+ C r (/ 2 ,/ 2 ,x 2 ,y)C(y/ 3 (x 3 )) (70) 
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Minimize over f 2 , f 2 and keep f p , fl , f constant: 

Kg™n£R{f l p ,flf p jlf,* hyi ) = (71) 

fp ifn i 

argmin£^ {c(y/ 2 (x 2 ))C(y/ 2 (x 2 )) + B 2 C r (f 2 ,f 2 ,x 2 ,y)} (72) 

Jp ->Jn i 

such that: (73) 

S 2 =C r (f 1 p ,flxly) (74) 

S 2 = S 3 +C(y/ 3 (x 3 )) (75) 

Minimize over f 3 and keep , /„' , f 2 , / 2 constant: 

argmin£fl(/; J^ 2 ,/ 2 ,/ 3 , Xi ,y,) = (76) 

/ i 

argmin£s 3 C(y/ 3 (x 3 ) (77) 

f 3 i 

such that: (78) 
Sf = Cr{fl,flx},y)C r {f 2 p jl^,y) (79) 



